Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)#1487
Open
ndokutovich wants to merge 1 commit intoopenai:mainfrom
Open
Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)#1487ndokutovich wants to merge 1 commit intoopenai:mainfrom
ndokutovich wants to merge 1 commit intoopenai:mainfrom
Conversation
…— val_bpb 1.0600 (3-seed mean) Tuned variant with QK-Gain 5.25, 10-epoch TTT (lr=0.00045, freeze 1 block). seed 42: 1.06023436 seed 1337: 1.05980538 seed 2024: 1.06010381 mean: 1.06004785 (std 0.0002)
owizdom
added a commit
to owizdom/parameter-golf
that referenced
this pull request
Apr 9, 2026
…nthesis (validation pending) First submission to stack three independently-legal val-data adaptations on the PR openai#1487 (1.0600) base: 1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations to align quantization with the eval distribution (novel on the modern stack; PR openai#1019 ablated this on its older base only) 3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering (Track B, builds on PR openai#1493) The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487 (1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars. Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val function, plus 8 hyperparameter defaults flipped). Architecture, optimizer, training loop, EMA, and quantization machinery are byte-identical to PR openai#1487. Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100 SXM time on RunPod; see VALIDATION.md. Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time score-first TTT). No SLOT, no n-gram cache, no ETLB. Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun, PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955, PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.
7 tasks
owizdom
added a commit
to owizdom/parameter-golf
that referenced
this pull request
Apr 9, 2026
…ib GPTQ + SLOT-24 Replaces the triple-stack (Pre-Quant TTT + Val-Calib GPTQ + Eval-Time Legal TTT) with a quad-stack that supersedes the legal TTT path with SLOT-24, ported from PR openai#1488 / PR openai#1313. Four val-data adaptations stacked for the first time: 1. Pre-Quant AdamW TTT — 11 epochs, freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X from val activations (Track A) 3. SLOT-24 — per-window hidden delta + logit bias on the frozen post-quant model, 24 cosine-decayed AdamW steps, throwaway parameters 4. (Optional) Eval-Time Legal Score-First TTT — disabled by default; SLOT supersedes it within the eval budget. Set SLOT_ENABLED=0 TTT_ENABLED=1 to fall back. Code changes vs the previous synthesis commit: - GPT class: split forward_logits into forward_hidden + compute_logits so SLOT can add the per-window delta to the hidden state without re-running the transformer stack. - New eval_val_slot function ported from PR openai#1488 (per-window AdamW with cosine LR decay, stride masking, score-after-delta). - run_evals: wires SLOT on a fresh post-quant model copy, gated by SLOT_ENABLED. Disables legal TTT by default. - New hyperparameters: SLOT_ENABLED, SLOT_STEPS, SLOT_LR, SLOT_LR_MIN, SLOT_BATCH_SEQS, SLOT_EVAL_STRIDE. Folder renamed: 2026-04-09_PreQuantTTT11_ValCalibGPTQ_LegalEvalTTT_Synthesis -> 2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis Time budget: ~530s of 600s eval used (590s train + 190s prequant TTT + 10s val-calib GPTQ + 80s sliding eval baseline + 250s SLOT-24). Code: 2322 lines (vs 2039 in PR openai#1487 base, +283 added). py_compile clean. README rewritten as user's submission with compact credits section.
7 tasks
Author
Author
|
Reopening this PR. When it was submitted, I closed it after @dexhunter raised a valid concern about Condition 3 compliance of the pre-quant TTT pattern (training on val data before quantization). I agreed the interpretation was unclear and closed proactively. Since then, PR #1517 has been submitted with the same pre-quant TTT approach (18 epochs). Reopening this PR pending official clarification on whether pre-quant TTT is legal under Issue #1017. If the ruling is that it violates Condition 3, I'll close again immediately. Result: val_bpb 1.0600 (3-seed mean), same architectural stack as described in the original submission. |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: SP8192 + Full Stack + Tuned Pre-Quant TTT
val_bpb = 1.0600 (3-seed mean, std 0.0002) | ~15.95 MB | 8xH100 SXM
3-Seed Results
What Changed vs PR #1485
Hyperparameter tuning on pre-quant TTT:
Same architecture, same code, different env vars. Delta: -0.0079 BPB.
Full Stack
SP8192, 11L/13 virtual (3-layer depth recurrence), parallel residuals (L7+), EMA 0.9965, QK-Gain 5.25, skip gates, MuonEq-R, pre-quant AdamW TTT (10ep, lr=0.00045, freeze 1, cosine), SDClip GPTQ int6 + int8 embed + brotli.
Compliance (Track A)
Reproduction
Credits
PR #1471 @X-Abhishek-X, PR #1423 @aryanbhosale, PR #1394 @clarkkev, PR #1204 @msisovic, PR #1482 @aamodbhatt
Checklist
records/track_10min_16mb/